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We  pro  pose  an  implementation  of  the  MPI  standard,  su  it  able  for  FIPEC  applications. 

The  FIPEC  requirements  addressed  by  the  proposed  implementation  include: 

•support  of  "zero  copy"  transport  mechanisms 
•memory  management 

•handling  heterogeneous  communication  topologies 

Zero  copy  communications  are  key  for  HPEC,  in  order  to  allocate  memory  bandwith  to  actual 
communications  rather  than  to  data  transfers  between  buffers.  The  semantics  and 
implementation  issues  associated  with  the  mapping  of  MPI  primitives  on  zero  copy  mechanisms 
such  as  VIA  [1]  or  RDMA  [2]  have  been  discussed  [3,  4,  5]  and  the  feasibility  and  performance 
benefits  have  been  demonstrated  on  the  basis  of  popular  MPI  implementations  such  as  LAM  [6] 
or  M  PICH  [7], 

Optimised,  industrial  MPI  implementations  based  on  zero  copy  transport  should  be  provided  on 
the  high-performance,  low  latency  media  suitable  for  HPEC  systems.  The  implementation 
described  in  this  paper  uses  a  serial  RapidIO  network  across  PowerPC  computing  nodes.  The 
applicability  to  Infiniband  or  Gigabit  ethernet  is  also  discussed. 

Zero  copy  communications  implies  constraints  on  memory  allocation  and  buffer  m  anagem  ent; 
memory  buffer  management  is  neither  covered  nor  precluded  by  the  MPI  standard.  Defining  an 
API  allowing  the  needed  level  of  memory  buffer  management,  while  preserving  compatibility  with 
the  MPI  standard  is  one  of  the  issues  addressed  by  this  implementation. 

The  paper  finally  proposes  a  mechanism,  compatible  with  the  MPI  standard,  to  describe  the 
communication  topology  of  a  HPEC  software,  such  as  a  dataflow,  signal  processing  application. 
Such  an  application  is  described  as  a  set  of  tasks  interconnected  by  virtual  links.  The 
application  graph,  and  its  mapping  on  a  possibly  heterogeneous  computing  and  communication 
infrastructure,  is  defined  through  an  independent  configuration  file,  allowing  to  modify  the 
application  deployment  without  recompilation.  The  translation  of  the  application  graph  in  terms  of 
MPI  groups  and  communicators  is  carried  out  at  run-time,  and  is  transparent  to  the  user. 
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-  Middleware  Libraries  and  Application  Programming  Interfaces 

-  Software  Architectures,  Reusability,  Scalability,  and  Standards 
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Heterogeneous  HPEC  systems  © 


Systems  used  for  Dataflow  applications 

Computing  power  requirements  no  evenly  spread 
Various  transport  medium  may  coexist 
Need  for  QoS  type  behaviour 
Performance  requirement  for  I/O  between  nodes 

Requirements 

Need  to  map  process  to  computing  node 
Need  to  select  specific  link  between  process 
Need  to  implement  zero-copy  feature 
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Using  MPI  in  HPEC  © 


T 


■  PROS 

O  Available  on  almost  every  parallel/cluster  machine 
O  Ensures  application  code  portability 

■  CONs 

O  Made  for  collective  parallel  apps,  not  distributed  apps. 

O  No  choice  of  communication  interface  (only  know  receiver) 

O  Does  not  care  about  transport  medium 

O  No  control  on  timeouts 

O  Not  a  communication  library 

(no  dynamic  connection,  no  select  feature) 
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Zero-copy  Requirements  © 


T 


Zero-copy  means  memory  management 

Same  memory  buffer  used  by  application  and  I/O  system 
At  any  given  time,  buffer  must  belong  to  application  OR  I/O 

Zero-copy  API 

Buffer  Get 

O  Data  buffer  now  part  of  application  data 
O  Can  be  used  as  any  private  memory 
Buffer  Release 

O  Data  buffer  is  not  to  be  modified  by  application  any  more 
Z>  Can  be  used  by  I/O  system  (likely  hardware  DMA) 
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Implementation  choice  © 


I 

MPI  Services  (MPS)  side  to  side  with  MPI 


O  MPI  application  source  portability 
O  Links/Connector  relationship 
O  Real-Time  support 

•  Links  to  select  communication  channels  (~  QoS) 

•  Requests  timeout  support 


O  Real  zero-copy  transfer 

•  Buffer  Management  API  (MPS) 

O  Heterogeneous  machine  support 

•  Topology  files  outside  application 


MPS 

MPI 

COM 
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Dedicated  MPI  Communicator  for  Zero-copy  Link  © 
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MPI  COMM  WORLD 
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HPEC  System  Topology  Description  © 


O 


System  topology  described  outside  the  application  code 
External  ASCII  files  with:  j 

O  Process 

•  Process  name 

•  Process  Hardware  location  (bo< 

O  Link 

•  Link  name 

•  Medium  type  (+medium-specific  parameters) 

•  Buffer  size 

•  Buffer  count 


VME  link 

Proc 

Proc 

A 

RIO  link 

B 

Link 


THALES  COMPUTERS 


THALES 


HPEC  2004  Poster  C5:  Optimised  MPI  for  HPEC  Applications 


1 


MPS  API:  processes  and  links  © 


MPS_Channel_create 

(*chan_name,  *  rendpoint,  MPI  Comm  *comm,  int  *lrank,  int  *rrank) ; 
link  name  A  A  |  |  | 

remote  end  name  |  |  |  | 

specific  communicator  for  the  link  v  \  \ 

my  rank  in  new  communicator  v  \ 

remote  end  rank  in  new  communicator  v 

MPS_Process_get_name  (int  rank,  char  *name) ; 
rank  in  MPI_COMM_WORLD  A  | 

my  name  in  link/process  file  v 

MPS_Process_get_rank  (char  ‘name,  int  ‘rank) ; 
name  in  link/process  file  A  | 

my  rank  in  MPI_COMM_WORLD  v 
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MPS  API:  Buffers  © 


M  PS_B  uf_pool_i  nit 

(MP  _Comm  com,  way,  *  p_bufsize,  *  p_bufcount,  *p_mps_pool) 
MPI  communicator  A  A  |  |  | 

Send  or  Receive  \  v  v  \ 

buffer  size  &  count  v 

MPS  pool  handle 

MPS_Buf_get  (p_mps_pool,  void  **p_buffer) 

get  buffer  from  pool  (may  block,  or  return  EEMPTY) 


MPS_Buf_release  (p_mps_pool,  void  *buffer) 

give  buffer  to  I/O  system  (compulsory  at  each  use)  busy??? 


M PS_B uf_pool_f i na I ize  (p_mps_pool ) 

free  all  buffers,  all  corns  must  have  completed  first 
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MPI/MPS  example  © 


T 


MPI_lnit(&argc,  &argv); 


Create  Dedicated  Link 
Get  Specific  connector 
Initialize  memory  pool 


MPS_Channel_create(“link1”,  “proc2”,  &com,  &lrank,  &rrank); 

MPS_buf_pool_init(com,  (sender)  ?  MPS_SND  :  MPS_RCV,  &bufsize,  &bufcount,  &pool); 
if  (sender)  { 

MPS_Buf_get(pool,  &buf);  - - - 1  Take  buffer  ownership 

Fill  in  with  data 

MPI_lsend(buf,  size/sizeof(int),  MPIJNT,  rrank,  99,  com,  &req); 

MPI_Wait(req,  &status); 


MPS_Buf_release(pool,  buf); 

}  else  { 

} 

MPS_Buf_pool_finalize(pool); 
MPI_Finalize(); 


Send  on  connector 
Release  buffer 
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Portability  © 


MPI  application  easily  ported  to  MPI/MPS  API 

See  example 


MPI/MPS  application  can  run  on  any  platform:  EMPS 

EMPS  is  MPS  emulation  on  top  of  standard  MPI  com 
Allow  to  run  MPI/MPS  code  unmodified 
O  Includes  buffer  and  link  management 
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MPI/MPS  Application 


(  'i 

libemps.a 

v _ J 


C  \ 

libmpi.a 

V _ J 
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Topology 
files  ^ 
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Current  Implementation  © 


T 


Based  on  MICH  ??  Version  etc... 

Software 

IA32  Red  Hat,  PowerPC  LynxOS  4.0 

HW  Targets 

PC,  Thales  multiprocessor  VME  boards 

Multi-protocol  support  in  COM  layer 

DDIink  :  Direct  Deposit  zero  copy  layer 
Z>  Fibre  Channel  RDMA,  Shared  Memory,  VME  2eSST,  RapidIO 
Standard  unix/posix  I/O 
O  Shared  Memory,  TCP/IP 
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Current  Work  © 


T 


Finalize  process  mapping 

MPI_RUN  and  HPEC  compatible  process  mapping 

Towards  automatic  code  generation 

Create  MPS  /  MPI  code  from  HPEC  application  tools 

More  support  for  MPI-aware  debug  tools 

■  Like  TotalView™ 


Thank  you 

vincent.chuffart@thalescomputers.fr 
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Systems  used  for  Dataflow  applications 

Computing  power  requirements  not  evenly  spread 
Various  transport  medium  may  coexist 
Need  for  QoS  type  behaviour 
Performance  requirement  for  I/O  between  nodes 

Requirements 

Need  to  map  process  to  computing  node 
Need  to  select  specific  link  between  process 
Need  to  implement  zero-copy  feature 
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Using  MPI  in  HPEC  © 
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■  PROS 

O  Available  on  almost  every  parallel/cluster  machine 
O  Ensures  application  code  portability 

■  CONs 

O  Made  for  collective  parallel  apps,  not  distributed  apps. 

O  No  choice  of  communication  interface  (only  know  receiver) 

O  Does  not  care  about  transport  medium 

O  No  control  on  timeouts 

O  Not  a  communication  library 

(no  dynamic  connection,  no  select  feature) 
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Zero-copy  Requirements  © 
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Zero-copy  means  memory  management 

Same  memory  buffer  used  by  application  and  I/O  system 
At  any  given  time,  buffer  must  belong  to  application  OR  I/O 

Zero-copy  API 

Buffer  Get 

O  Data  buffer  now  part  of  application  data 
O  Can  be  used  as  any  private  memory 
Buffer  Release 

O  Data  buffer  is  not  to  be  modified  by  application  any  more 
Z>  Can  be  used  by  I/O  system  (likely  hardware  DMA) 
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Dedicated  MPI  Communicator  for  Zero-copy  Link  © 


T 


MPI  COMM  WORLD 
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Implementation  choice  © 


I 

MPI  Services  (MPS)  side  to  side  with  MPI 


O  MPI  application  source  portability 
O  Links/Connector  relationship 
O  Real-Time  support 

•  Links  to  select  communication  channels  (~  QoS) 

•  Requests  timeout  support 


O  Real  zero-copy  transfer 

•  Buffer  Management  API  (MPS) 

O  Heterogeneous  machine  support 

•  Topology  files  outside  application 


MPS 

MPI 

COM 
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HPEC  System  Topology  File  © 


O 


System  topology  described  outside  the  application  code 
External  ASCII  files  with:  j 

O  Process 

•  Process  name 

•  Process  Hardware  location  (bo< 

O  Link 

•  Link  name 

•  Medium  type  (+medium-specific  parameters) 

•  Buffer  size 

•  Buffer  count 


VME  link 

Proc 

Proc 

A 

RIO  link 

B 

Link 
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MPS  API:  processes  and  links  © 


MPS_Channel_create 

(*chan_name,  *  rendpoint,  MPI  Comm  *comm,  int  *lrank,  int  *rrank) ; 
link  name  A  A  |  |  | 

remote  end  name  |  |  |  | 

specific  communicator  for  the  link  v  \  \ 

my  rank  in  new  communicator  v  \ 

remote  end  rank  in  new  communicator  v 

MPS_Process_get_name  (int  rank,  char  *name) ; 
rank  in  MPI_COMM_WORLD  A  | 

my  name  in  link/process  file  v 

MPS_Process_get_rank  (char  ‘name,  int  ‘rank) ; 
name  in  link/process  file  A  | 

my  rank  in  MPI_COMM_WORLD  v 
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MPS  API:  Buffers  © 


M  PS_B  uf_pool_i  nit 

(MP  _Comm  com,  way,  *  p_bufsize,  *  p_bufcount,  *p_mps_pool) 
MPI  communicator  A  A  |  |  | 

Send  or  Receive  \  v  v  \ 

buffer  size  &  count  v 

MPS  pool  handle 

MPS_Buf_get  (p_mps_pool,  void  **p_buffer) 

get  buffer  from  pool  (may  block,  or  return  EEMPTY) 


MPS_Buf_release  (p_mps_pool,  void  *buffer) 

give  buffer  to  I/O  system  (compulsory  at  each  use)  busy??? 


M PS_B uf_pool_f i na I ize  (p_mps_pool ) 

free  all  buffers,  all  corns  must  have  completed  first 
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MPI/MPS  example  © 


T 


MPI_lnit(&argc,  &argv); 


Create  Dedicated  Link 
Get  Specific  connector 
Initialize  memory  pool 


MPS_Channel_create(“link1”,  “proc2”,  &com,  &lrank,  &rrank); 

MPS_buf_pool_init(com,  (sender)  ?  MPS_SND  :  MPS_RCV,  &bufsize,  &bufcount,  &pool); 
if  (sender)  { 

MPS_Buf_get(pool,  &buf);  - - - 1  Take  buffer  ownership 

Fill  in  with  data 

MPI_lsend(buf,  size/sizeof(int),  MPIJNT,  rrank,  99,  com,  &req); 

MPI_Wait(req,  &status); 


MPS_Buf_release(pool,  buf); 

}  else  { 

} 

MPS_Buf_pool_finalize(pool); 
MPI_Finalize(); 


Send  on  connector 
Release  buffer 
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Portability  © 


MPI  application  easily  ported  to  MPI/MPS  API 

See  example 


MPI/MPS  application  can  run  on  any  platform:  EMPS 

EMPS  is  MPS  emulation  on  top  of  standard  MPI  com 
Allow  to  run  MPI/MPS  code  unmodified 
O  Includes  buffer  and  link  management 
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MPI/MPS  Application 
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libemps.a 

v _ J 


C  \ 

libmpi.a 

V _ J 


▼ 


Topology 
files  ^ 
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Current  Implementation  © 


T 


Software 

Runs  on  IA32  Linux,  PowerPC  LynxOS  4.0 

HW  Targets 

PC,  Thales  Computers  multiprocessor  VME  boards 

Multi-protocol  support  in  COM  layer 

DDIink  :  Direct  Deposit  zero  copy  layer 
O  Fibre  Channel  RDMA,  Shared  Memory,  VME  2eSST,  RapidIO 
■  Standard  Unix/Posix  I/O 
O  Shared  Memory,  TCP/IP 
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Future  Work  © 
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Finalize  process  mapping 

MPI_RUN  and  HPEC  compatible  process  mapping 

Towards  automatic  code  generation 

Create  MPS  /  MPI  code  from  HPEC  application  tools 


Thank  you 

vincent.chuffart@thalescomputers.fr 
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